Exploratory Analysis

We chose to model auto prices using the Autos data. This data comes from a larger Kaggle dataset that was scraped from a German Ebay site. The training data we received contained 2000 car listings with 20 variables each.

Understanding the data

Before we can model the relationship between price and the other variables, we need to understand the data in our training set. A summary of the dataset can be found below.

##               dateCrawled                                  name     
##  2016-03-23 17:46:12:   2   Audi_A4_Avant_1.9_TDI            :   5  
##  2016-03-28 16:57:13:   2   BMW_320i                         :   5  
##  2016-03-05 14:09:18:   1   Volkswagen_Passat_Variant_1.9_TDI:   4  
##  2016-03-05 14:14:33:   1   BMW_116i                         :   3  
##  2016-03-05 14:17:34:   1   BMW_318i                         :   3  
##  2016-03-05 14:24:20:   1   Citro<eb>n_C1_1.0_Advance        :   3  
##  (Other)            :1992   (Other)                          :1977  
##         seller       offerType        price            abtest    
##  gewerblich:   0   Angebot:2000   Min.   :     0   control: 968  
##  privat    :2000   Gesuch :   0   1st Qu.:  1150   test   :1032  
##                                   Median :  2950                 
##                                   Mean   :  6042                 
##                                   3rd Qu.:  7500                 
##                                   Max.   :123456                 
##                                                                  
##      vehicleType  yearOfRegistration      gearbox        powerPS    
##  limousine :490   Min.   :1934                : 128   Min.   :   0  
##  kleinwagen:412   1st Qu.:1999       automatik: 425   1st Qu.:  69  
##  kombi     :374   Median :2003       manuell  :1447   Median : 105  
##            :220   Mean   :2013                        Mean   : 117  
##  bus       :161   3rd Qu.:2009                        3rd Qu.: 150  
##  cabrio    :129   Max.   :8888                        Max.   :2331  
##  (Other)   :214                                                     
##      model        kilometer      monthOfRegistration    fuelType   
##  golf   : 158   Min.   :  5000   Min.   : 0.000      benzin :1212  
##  andere : 150   1st Qu.:100000   1st Qu.: 3.000      diesel : 541  
##         : 126   Median :150000   Median : 6.000             : 208  
##  3er    : 125   Mean   :124370   Mean   : 5.671      lpg    :  30  
##  polo   :  71   3rd Qu.:150000   3rd Qu.: 9.000      cng    :   4  
##  a4     :  62   Max.   :150000   Max.   :12.000      hybrid :   3  
##  (Other):1308                                        (Other):   2  
##            brand     notRepairedDamage              dateCreated  
##  volkswagen   :436       : 365         2016-04-02 00:00:00:  84  
##  bmw          :235   ja  : 181         2016-03-20 00:00:00:  80  
##  opel         :203   nein:1454         2016-03-14 00:00:00:  78  
##  audi         :192                     2016-03-19 00:00:00:  78  
##  mercedes_benz:188                     2016-03-07 00:00:00:  76  
##  ford         :127                     2016-04-03 00:00:00:  76  
##  (Other)      :619                     (Other)            :1528  
##   nrOfPictures   postalCode                   lastSeen   
##  Min.   :0     Min.   : 1069   2016-03-12 21:44:42:   2  
##  1st Qu.:0     1st Qu.:28326   2016-03-17 14:45:58:   2  
##  Median :0     Median :49514   2016-04-03 16:16:09:   2  
##  Mean   :0     Mean   :50491   2016-04-05 13:17:05:   2  
##  3rd Qu.:0     3rd Qu.:71588   2016-04-05 14:45:46:   2  
##  Max.   :0     Max.   :99974   2016-04-05 15:16:27:   2  
##                                (Other)            :1988

Immediately we can see that missing labels could be a problem. 220 training entries don’t have a vehicleType, 126 entries don’t have a model, 128 entries don’t have a gearbox listed, 208 entries are missing a fuelType, and 365 entries don’t have an answer to notRepairedDamage.

any(is.na(train))
## [1] FALSE
any(is.na(test))
## [1] FALSE

These missing labels aren’t considered NA because their label is an empty string, i.e. “”. We should consider renaming these missing labels so we can understand them in the models later. Missing or unknown values are hidden in the data in the form of 0s in the variables for powerPS (horsepower), monthOfRegistration, and price; yearOfRegistration uses year numbers that don’t exist (e.g., 8888) to show unknown years, or maybe to indicate the car hasn’t been registered yet, but that’s a strange way of indicating that. The prices with 0 values could also be a seller’s trick to make the car show up first when potential buyers search from lowest to highest price; sellers tend to put the actual price in the description or require interested buyers to contact them for a price. For example, the following entry is listed at 1 Euro, but its name translates to “We buy all Porsche 911 and Mercedes SL R107”, which hints that this isn’t even referencing a car for sale but someone who wants to buy expensive cars.

head(train[which(row.names(train)== "125059"),])
##                dateCrawled                                            name
## 125059 2016-03-07 21:41:17 Wir_kaufen_alle_porche_911_und_Mercedes_sl_r107
##        seller offerType price  abtest vehicleType yearOfRegistration
## 125059 privat   Angebot     1 control                           1960
##        gearbox powerPS model kilometer monthOfRegistration fuelType
## 125059               0    sl      5000                   0         
##                brand notRepairedDamage         dateCreated nrOfPictures
## 125059 mercedes_benz                   2016-03-07 00:00:00            0
##        postalCode            lastSeen
## 125059      70174 2016-03-08 02:46:32

We should try to give the “” entries a label, like “unknown”. We might as well translate some of the levels as we go, too.

train$notRepairedDamage = as.factor(ifelse(train$notRepairedDamage == "ja", "yes", ifelse(train$notRepairedDamage == "nein", "no", "unknown")))
train$gearbox = as.factor(ifelse(train$gearbox == "automatik", "automatic", ifelse(train$gearbox == "manuell", "manual", "unknown")))

# Because factors are strange and there are lots of levels, we need to try another tactic for changing the level names
vehicleType = sapply(train$vehicleType, as.character)
vehicleType = ifelse(vehicleType == "", "unknown", vehicleType)
train$vehicleType = as.factor(vehicleType)

fuelType = sapply(train$fuelType, as.character)
fuelType = ifelse(fuelType == "", "unknown", fuelType)
train$fuelType = as.factor(fuelType)

modelName = sapply(train$model, as.character)
modelName = ifelse(modelName == "", "unknown", modelName)
train$model = as.factor(modelName)

brands = sapply(train$brand, as.character)
brands = ifelse(brands == "", "unknown", brands)
train$brand = as.factor(brands)

summary(train)
##               dateCrawled                                  name     
##  2016-03-23 17:46:12:   2   Audi_A4_Avant_1.9_TDI            :   5  
##  2016-03-28 16:57:13:   2   BMW_320i                         :   5  
##  2016-03-05 14:09:18:   1   Volkswagen_Passat_Variant_1.9_TDI:   4  
##  2016-03-05 14:14:33:   1   BMW_116i                         :   3  
##  2016-03-05 14:17:34:   1   BMW_318i                         :   3  
##  2016-03-05 14:24:20:   1   Citro<eb>n_C1_1.0_Advance        :   3  
##  (Other)            :1992   (Other)                          :1977  
##         seller       offerType        price            abtest    
##  gewerblich:   0   Angebot:2000   Min.   :     0   control: 968  
##  privat    :2000   Gesuch :   0   1st Qu.:  1150   test   :1032  
##                                   Median :  2950                 
##                                   Mean   :  6042                 
##                                   3rd Qu.:  7500                 
##                                   Max.   :123456                 
##                                                                  
##      vehicleType  yearOfRegistration      gearbox        powerPS    
##  limousine :490   Min.   :1934       automatic: 425   Min.   :   0  
##  kleinwagen:412   1st Qu.:1999       manual   :1447   1st Qu.:  69  
##  kombi     :374   Median :2003       unknown  : 128   Median : 105  
##  unknown   :220   Mean   :2013                        Mean   : 117  
##  bus       :161   3rd Qu.:2009                        3rd Qu.: 150  
##  cabrio    :129   Max.   :8888                        Max.   :2331  
##  (Other)   :214                                                     
##      model        kilometer      monthOfRegistration    fuelType   
##  golf   : 158   Min.   :  5000   Min.   : 0.000      andere :   2  
##  andere : 150   1st Qu.:100000   1st Qu.: 3.000      benzin :1212  
##  unknown: 126   Median :150000   Median : 6.000      cng    :   4  
##  3er    : 125   Mean   :124370   Mean   : 5.671      diesel : 541  
##  polo   :  71   3rd Qu.:150000   3rd Qu.: 9.000      hybrid :   3  
##  a4     :  62   Max.   :150000   Max.   :12.000      lpg    :  30  
##  (Other):1308                                        unknown: 208  
##            brand     notRepairedDamage              dateCreated  
##  volkswagen   :436   no     :1454      2016-04-02 00:00:00:  84  
##  bmw          :235   unknown: 365      2016-03-20 00:00:00:  80  
##  opel         :203   yes    : 181      2016-03-14 00:00:00:  78  
##  audi         :192                     2016-03-19 00:00:00:  78  
##  mercedes_benz:188                     2016-03-07 00:00:00:  76  
##  ford         :127                     2016-04-03 00:00:00:  76  
##  (Other)      :619                     (Other)            :1528  
##   nrOfPictures   postalCode                   lastSeen   
##  Min.   :0     Min.   : 1069   2016-03-12 21:44:42:   2  
##  1st Qu.:0     1st Qu.:28326   2016-03-17 14:45:58:   2  
##  Median :0     Median :49514   2016-04-03 16:16:09:   2  
##  Mean   :0     Mean   :50491   2016-04-05 13:17:05:   2  
##  3rd Qu.:0     3rd Qu.:71588   2016-04-05 14:45:46:   2  
##  Max.   :0     Max.   :99974   2016-04-05 15:16:27:   2  
##                                (Other)            :1988

It’s now easier to see which factor labels were missing.

Based on our knowledge of postal codes and months, we know that postalCode and monthOfRegistration should be considered factors or whole numbers and not continuous values. We know the summary output treats these variables as continuous because they have medians and means instead of levels. In Germany, some postal codes start with 0, which is why it appears that some postal code values are incomplete with only 4 digits. We looked at a map of postal code districts in Germany (picture here: https://en.wikipedia.org/wiki/List_of_postal_codes_in_Germany) and decided to make the postalCode variable more manageable by creating categories based on the first digit of the postal code.

train$postalCodeCluster = NULL
for(i in 1:length(train$postalCode)){
  train$postalCodeCluster[i] = floor((train$postalCode[i])/10000)
}
train$postalCodeCluster = as.factor(train$postalCodeCluster)
plot(train$postalCodeCluster, log(train$price), xlab="First digit of postal code", ylab="Log-Price", main="Figure 1. Log Price by Postal Code district")

We used log-price in order to see the changes in means more clearly. The average log-price doesn’t change much at all across postal code categories (figure 1). This could be a result of our choice of grouping, but it seems like postalCode doesn’t have much of an impact on log-price, so we can most likely leave it out of our model.

The monthOfRegistration doesn’t show any patterns of seasonality with respect to log-price, so we can probably ignore that variable, too (figure 2).

We could potentially remove the nrOfPictures variable since none of the 2000 training entries used pictures, so we can’t effectively train a model using this variable even if members of the testing dataset had pictures on the webpage. The description file also mentioned that this variable may be a bug in the webcrawler, so we should definitely leave it out of our analysis. Similarly, all postings in this training set appear to be from a private seller (privat) instead of a commercial/dealer seller (gewerblich), and everything had an offerType of “offer” (Angebot) instead of “request” (Gesuch). Therefore, these variables won’t be useful to predict the price of a commercial seller or request listing.

The description file doesn’t describe what abtest refers to, but it most likely refers to the design of the website or ad (test vs. control) and doesn’t have anything to do with the value of the car. However, it could also refer to ads for the same car but using different prices or design techniques. More analysis is needed, but it seems unlikely to be linked to the actual value of the car.

Figure 3 shows that there are lots of outliers in both the control and test abtest levels compared to original price, but we can see in figure 4 that the control and test levels are actually very similar in terms of the log-price, so this variable is unlikely to be useful.

The continuous variables in this dataset include price, yearOfRegistration, powerPS, and kilometer; with the exception of dates, which should potentially be treated separately, the other variables should be considered factors. Figure 5 shows a pairs plot of all continuous variables.

There appears to be several outliers in the yearOfRegistration variable and potentially within the price variable itself. Most entries appear to have a lower price, but the sparse entries with high prices make it difficult to tell.

# The number of records with powerPS > 1000
sum(ifelse(train$powerPS > 1000, 1, 0))
## [1] 5
# The number of records with yearOfRegistration > 2018
sum(ifelse(train$yearOfRegistration > 2018, 1, 0))
## [1] 3

Very few vehicles have a powerPS listed above 1000, and even fewer have infeasible yearOfRegistration values. If we remove these values, the pairs plots may look cleaner.

# Removing these entries and viewing the pairs plot in comparison to log-price may make things clearer
tr_dat = train[which(train$powerPS < 1000),c(5,8,10,12)]
tr_dat = tr_dat[which(tr_dat$yearOfRegistration <= 2018),]
tr_dat = data.frame(log(tr_dat$price), tr_dat)
pairs(tr_dat, main="Figure 6. Pairs Plot with Log-Price (Some Outliers Removed)")

Now powerPS looks fairly linear against log-price but has a slight curve, with just a few data points along log-price = 0 which could be due to the 0 Euro price listings (figure 6). We might consider taking the log of powerPS to smooth out the curvature.

plot(log(tr_dat$powerPS), tr_dat$log.tr_dat.price., xlab="LogPowerPS", ylab="LogPrice", main="Figure 7. Log PowerPS vs. Log-Price")

As figure 7 shows, taking the logarithm of powerPS results in a more linear relationship with logPrice, but there are still a few outliers to possibly consider. Over 200 entries have a powerPS of 0, which may be a problem where we have to take the log of 0.

monthOfRegistration doesn’t look too interesting according to figure 2, but yearOfRegistration could be hiding a relationship. Kilometer also could be relevant, but since we’re working with thousands of kilometers, it may be interesting to take the log of kilometers.

Figure 8 shows the log-price, yearOfRegistration, log-powerPS, and log-kilometer with the outliers and too-low prices removed. We can see a strange pattern going on between log-price and yearOfRegistration; older cars have a high price, but the price gradually gets lower as time goes on until about 1990 when the price gets as low as we’ll allow and more cars are on the market. The log-price to year relationship then turns positive, increasing price as time increases. However, within a few years of 2018, the price starts spanning the entire range of prices, causing a triangular chunk of emptiness to appear in the data. This strange shape may be from the shift from currently available cars to future cars because this data was collected in 2016, which is near the point of the triangle. We also have to consider that fewer cars are old cars, so we can probably consider powerPS to reflect part of the age of the car given regular use. The log-powerPS variable, as noted earlier, looks fairly linear, although there are still a few outliers. Taking the log of kilometer emphasized the disjointness of the original variable since the kilometers traveled were most likely rounded to the nearest 1000 km. Log-kilometer looks less helpful than the original kilometers, so this transformation probably shouldn’t persist into the model.

Several potential predictors indicate the presence of too many levels to work with, such as dateCrawled, name, vehicleType, model, brand, dateCreated, and lastSeen.

nlevels(train$dateCrawled) # 164590 levels
nlevels(train$model) # 251 levels
nlevels(train$name) # 128113 levels
nlevels(train$vehicleType) # 9 levels
nlevels(train$brand) # 40 levels
nlevels(train$dateCreated) #97 levels
nlevels(train$lastSeen) # 111190 levels

The vehicleType variable has a manageable amount of levels with 9, and brand may be close to working alright, but the others are too unique. The name variable is created by sellers and usually contains the make, model, and potentially important specifications, so each one is incredibly unique. The dates (dateCrawled, dateCreated, and lastSeen) could be condensed to the month or day level to shrink the levels to less than 100. Model is usually important in the car-buying process, but 251 levels may be too many to handle; figure 11 shows that there are differences in the mean log-price by model but this is very difficult to see.

Figures 9 and 10 show that there are differences in mean log-price across the different levels of vehicle type and brand, respectively. At 40 levels, brand is already difficult to distinguish all of the different brand names, so the greater amounts of factor levels would be even harder to understand, like in figure 11 with log-price by model.

In order to reduce the number of levels in factors like model and brand, we tried grouping the low-frequency levels together. We wanted to balance the number of levels (which ultimately will be turned into dummy variables) with the frequency of the levels.

#str(train)
brand_counts <- as.data.frame(sort(table(train$brand), decreasing = T))
brand_counts$percentage <- (brand_counts$Freq/sum(brand_counts$Freq))*100
brand_counts$cum <- cumsum(brand_counts$percentage)
#View(brand_counts)

top_brands = brand_counts[1:10, 1]
brand_groups = as.factor(ifelse(as.character(train$brand) %in% as.character(top_brands), as.character(train$brand), "other"))
train = data.frame(train, brand_groups)
plot(train$brand_groups, log(train$price), xlab="Brand(Groups)", ylab="Log-Price", main="Figure 12. Log-Price by Brand Groups")

We chose to keep the top 10 brands as distinct levels because, together, they made up 80% of the data. This left 20% to the “other” category. It’s easier to see the different levels now (figure 12), and we can see differences in average price across the brand levels.

model_counts <- as.data.frame(sort(table(train$model), decreasing = T))
model_counts$percentage <- (model_counts$Freq/sum(model_counts$Freq))*100
model_counts$cum <- cumsum(model_counts$percentage)
#View(model_counts)

top_models = model_counts[1:24, 1]
model_groups = as.factor(ifelse(as.character(train$model) %in% as.character(top_models), as.character(train$model), "other"))
train = data.frame(train, model_groups)
par(mfrow=c(1,1))
plot(train$model_groups, log(train$price), xlab="Model (Group)", ylab="Log-Price", main="Figure 13. Log-Price by Model Group")

The models were more difficult to narrow down because 12 levels only make up ~50% of the data, so we needed to use more levels to make up for the lack of frequencies. We ended up choosing the top 24 models because there were at least 20 entries with these model values, which was about 67% of the data. This reduction in factor levels makes it much easier to see the differences among models in average log-price (figure 13).

Other categorical variables include gearbox, fuelType, and notRepairedDamage. Figures 14 through 16 show these variables against the log-price.

Figure 14. shows that the mean log-price for automatic gearshifts is higher than manual or unknown gearshifts, and manual gearshifts are slightly more expensive than unknown gearshifts on average. The log-price is higher on average for vehicles without unrepaired damage than for cars that still have unrepaired damage, and the average log-price of the unknown damage is somewhere between the other two levels (figure 15). Figure 16 shows that some of the fuel type levels are represented more than others, but the average log-price appears to still be different across the fuel types. For instance, andere seems to have 1 or fewer instances, but the benzin and unknown levels have a large span and similar average log-prices. The levels for cng and hybrid have higher log-prices on average than the other levels, followed by diesel, then lpg, benzin, unknown, and andere. There really aren’t too many instances of cng, andere, hybrid, or lpg compared to the other levels, so we could try grouping andere, cng, hybrid, lpg, and unknown together to give 3 levels.

fuelType_group = as.factor(ifelse(as.character(train$fuelType) == "benzin", as.character(train$fuelType), ifelse(as.character(train$fuelType) == "diesel", as.character(train$fuelType), "other")))
train = data.frame(train, fuelType_group)
plot(train$fuelType_group, log(train$price), xlab="FuelType Group", ylab="Log-Price", main="Figure 17. Log-Price by FuelType Group")

The average log-price is clearly different across the 3 levels of the grouped fuelType variable (figure 17). This should help reduce the number of dummy variables in the models later.

There could be interactions in the data between the various factors and continuous variables, so we explored different combinations. Figures 18 through 23 show how kilometer looks against log-price when colored by the different levels of factor variables (fuelType_group, brand_groups, model_groups, gearbox, notRepairedDamage, and vehicleType, respectively). Figures 24 through 29 explore the relationship between those same factor variables and log-powerPS.

# Interactions with Kilometer
par(mfrow=c(1,2))
plot(train$kilometer, log(train$price), col=train$fuelType_group, xlab="Kilometer", ylab="Log-Price", main="Figure 18. Kilometer & FuelType Group")
plot(train$kilometer, log(train$price), col=train$brand_groups, xlab="Kilometer", ylab="Log-Price", main="Figure 19. Kilometer & Brand Group")

plot(train$kilometer, log(train$price), col=train$model_groups, xlab="Kilometer", ylab="Log-Price", main="Figure 20. Kilometer & Model Group")
plot(train$kilometer, log(train$price), col=train$gearbox, xlab="Kilometer", ylab="Log-Price", main="Figure 21. Kilometer & Gearbox")

plot(train$kilometer, log(train$price), col=train$notRepairedDamage, xlab="Kilometer", ylab="Log-Price", main="Figure 22. Kilometer & NotRepairedDamage")
plot(train$kilometer, log(train$price), col=train$vehicleType, xlab="Kilometer", ylab="Log-Price", main="Figure 23. Kilometer & VehicleType")

# Interactions with powerPS
plot(log(train$powerPS), log(train$price), col=train$fuelType_group, xlab="Log-PowerPS", ylab="Log-Price", main="Figure 24. Log-PowerPS & FuelType Group")
plot(log(train$powerPS), log(train$price), col=train$brand_groups, xlab="Log-PowerPS", ylab="Log-Price", main="Figure 25. Log-PowerPS & Brand Group")

plot(log(train$powerPS), log(train$price), col=train$model_groups, xlab="Log-PowerPS", ylab="Log-Price", main="Figure 26. Log-PowerPS & Model Group")
plot(log(train$powerPS), log(train$price), col=train$gearbox, xlab="Log-PowerPS", ylab="Log-Price", main="Figure 27. Log-PowerPS & Gearbox")

plot(log(train$powerPS), log(train$price), col=train$notRepairedDamage, xlab="Log-PowerPS", ylab="Log-Price", main="Figure 28. Log-PowerPS & NotRepairedDamage")
plot(log(train$powerPS), log(train$price), col=train$vehicleType, xlab="Log-PowerPS", ylab="Log-Price", main="Figure 29. Log-PowerPS & VehicleType")

Visually, it’s hard to discern whether certain factor variables have interactions with either kilometer or log-powerPS; there aren’t any clear distinctions that would make us think that one variable has influence over another, but it’s difficult to see in this 2-dimensional way.

Transformations

We should remove the low-ball prices because they don’t reflect the selling price of a car. Also, the German Ebay page this data was retrieved from allows sellers to list cars as “To give away” (free), which might mean the cars are in poor shape, especially based on the current page’s pictures. This distinction between selling tactic and car quality might be resolved with a car quality predictor, but that information isn’t available. As it stands, the price variable is strongly skewed with a right tail (figure 30). Taking the logarithm of price leads to a normal-looking distribution with a mean around 8; however, there is still a small bump close to 0 (figure 31).

Removing entries with prices less than 100 Euro removes the bump around 0 from figure 31, leaving us with a nice Guassian curve (figure 32).

Limiting prices to above 100 Euro only removes 74 entries from our training set (about 3.7%), and this doesn’t change the overall summary of the training set too much.

nrow(train2) #1926 rows
## [1] 1926
summary(train2)
##               dateCrawled       price            vehicleType 
##  2016-03-23 17:46:12:   2   Min.   :   100   limousine :482  
##  2016-03-28 16:57:13:   2   1st Qu.:  1299   kleinwagen:402  
##  2016-03-05 14:09:18:   1   Median :  3100   kombi     :368  
##  2016-03-05 14:14:33:   1   Mean   :  6274   unknown   :183  
##  2016-03-05 14:17:34:   1   3rd Qu.:  7800   bus       :158  
##  2016-03-05 14:24:20:   1   Max.   :123456   cabrio    :129  
##  (Other)            :1918                    (Other)   :204  
##  yearOfRegistration      gearbox        powerPS          kilometer     
##  Min.   :1934       automatic: 417   Min.   :   0.00   Min.   :  5000  
##  1st Qu.:1999       manual   :1409   1st Qu.:  72.25   1st Qu.:100000  
##  Median :2003       unknown  : 100   Median : 109.00   Median :150000  
##  Mean   :2013                        Mean   : 118.13   Mean   :124958  
##  3rd Qu.:2009                        3rd Qu.: 150.00   3rd Qu.:150000  
##  Max.   :8888                        Max.   :2331.00   Max.   :150000  
##                                                                        
##  notRepairedDamage              dateCreated                  lastSeen   
##  no     :1436      2016-04-02 00:00:00:  79   2016-03-12 21:44:42:   2  
##  unknown: 327      2016-03-19 00:00:00:  77   2016-03-17 14:45:58:   2  
##  yes    : 163      2016-03-14 00:00:00:  73   2016-04-03 16:16:09:   2  
##                    2016-03-20 00:00:00:  73   2016-04-05 13:17:05:   2  
##                    2016-03-29 00:00:00:  73   2016-04-05 14:45:46:   2  
##                    2016-04-03 00:00:00:  73   2016-04-05 15:16:27:   2  
##                    (Other)            :1478   (Other)            :1914  
##         brand_groups  model_groups fuelType_group    logPrice     
##  volkswagen   :418   other  :673   benzin:1182    Min.   : 4.605  
##  other        :379   golf   :149   diesel: 537    1st Qu.: 7.169  
##  bmw          :224   andere :143   other : 207    Median : 8.039  
##  opel         :196   3er    :118                  Mean   : 8.038  
##  audi         :186   unknown:101                  3rd Qu.: 8.962  
##  mercedes_benz:183   polo   : 70                  Max.   :11.724  
##  (Other)      :340   (Other):672                                  
##    logPowerPS   
##  Min.   : -Inf  
##  1st Qu.:4.280  
##  Median :4.691  
##  Mean   : -Inf  
##  3rd Qu.:5.011  
##  Max.   :7.754  
## 

We should also consider cleaning up some of the date labels, either at the month or day levels, in order to see any patterns like seasonality that might be hidden. The description file indicates that dateCrawled is the date the ad was scraped, dateCreated is the date the ad was created, and lastSeen is the last date the page crawler saw the ad. The relationship between dateCreated and lastSeen might be interesting in terms of how long it took to sell the car, which would require taking the difference between dateCreated and lastSeen. However, this may be unrelated to determining the selling price of the car and be more of a reflection of the sale price.

par(mfrow=c(2,2))
dateCrawl_month = as.factor(format(as.Date(train2$dateCrawled), "%Y-%m"))
plot(dateCrawl_month, train2$logPrice, main="Figure 33. Month of crawl vs Log-Price")

dateCrawl_day = as.factor(as.Date(train2$dateCrawled, "%Y-%m-%d"))
plot(dateCrawl_day, train2$logPrice, main="Figure 34. Month/Day of crawl vs Log-Price")

adLength = as.double(difftime(as.Date(train2$lastSeen, "%Y-%m-%d"), as.Date(train2$dateCreated, "%Y-%m-%d"), units="days"))
#summary(adLength)
plot(adLength, train2$logPrice, main="Figure 35. Ad length vs. Log-Price", xlab="Number of days ad was online", ylab="Log-Price")

ad_data = data.frame(train2, adLength)
ad_data = ad_data[which(adLength < 40),]
plot(ad_data$adLength, ad_data$logPrice, main="Figure 36. Ad length (clean) vs. Log-Price"
     , xlab = "Number of days ad was online", ylab="Log-Price")

Aggregating the log-price into the month or day the ad was crawled doesn’t unveil any hidden patterns since the mean log-price appears very similar across months and days (figures 33 and 34, respectively). The length of time between when the ad was created and the last time the webcrawler saw the ad doesn’t seem to have any effect on log-price either, although it looks like most ads were taken down in the first 2 weeks to a month (figures 35 and 36). This leads us to believe that the date variables won’t be helpful in predicting price.

##      vehicleType       gearbox       kilometer      notRepairedDamage
##  limousine :453   automatic: 383   Min.   :  5000   no     :1355     
##  kleinwagen:354   manual   :1299   1st Qu.:100000   unknown: 224     
##  kombi     :336   unknown  :  39   Median :150000   yes    : 142     
##  bus       :148                    Mean   :124831                    
##  cabrio    :121                    3rd Qu.:150000                    
##  unknown   :119                    Max.   :150000                    
##  (Other)   :190                                                      
##         brand_groups  model_groups fuelType_group    logPrice     
##  volkswagen   :381   other  :742   benzin:1079    Min.   : 4.605  
##  other        :339   golf   :133   diesel: 499    1st Qu.: 7.279  
##  bmw          :206   3er    :107   other : 143    Median : 8.161  
##  opel         :174   unknown: 72                  Mean   : 8.136  
##  audi         :171   polo   : 62                  3rd Qu.: 9.036  
##  mercedes_benz:162   a4     : 55                  Max.   :11.653  
##  (Other)      :288   (Other):550                                  
##    logPowerPS   
##  Min.   :1.386  
##  1st Qu.:4.431  
##  Median :4.754  
##  Mean   :4.744  
##  3rd Qu.:5.011  
##  Max.   :6.252  
## 
## [1] 1721

Removing outliers from powerPS and yearOfRegistration and oddities in the data such as prices of 0 Euro or powerPS of 0, we are left with 1721 entries of the original 2000.

After the exploration, we decided that the following will be our list of predictors, and our response will be log-price.

Predictors:

  • vehicleType

  • gearBox

  • logPowerPS

  • model_groups

  • kilometer

  • fuelType_group

  • brand_groups

  • notRepairedDamage

Test data transformations

For every transformation made to our training set, we should make the same transformations to the test set.

test$notRepairedDamage = as.factor(ifelse(test$notRepairedDamage == "ja", "yes", ifelse(test$notRepairedDamage == "nein", "no", "unknown")))
test$gearbox = as.factor(ifelse(test$gearbox == "automatik", "automatic", ifelse(test$gearbox == "manuell", "manual", "unknown")))

vehicleType = sapply(test$vehicleType, as.character)
vehicleType = ifelse(vehicleType == "", "unknown", vehicleType)
test$vehicleType = as.factor(vehicleType)

fuelType = sapply(test$fuelType, as.character)
fuelType = ifelse(fuelType == "", "unknown", fuelType)
test$fuelType = as.factor(fuelType)

modelName = sapply(test$model, as.character)
modelName = ifelse(modelName == "", "unknown", modelName)
test$model = as.factor(modelName)

brands = sapply(test$brand, as.character)
brands = ifelse(brands == "", "unknown", brands)
test$brand = as.factor(brands)

#str(test)
brand_groups = as.factor(ifelse(as.character(test$brand) %in% as.character(top_brands), as.character(test$brand), "other"))
test = data.frame(test, brand_groups)

model_groups = as.factor(ifelse(as.character(test$model) %in% as.character(top_models), as.character(test$model), "other"))
test = data.frame(test, model_groups)

fuelType_group = as.factor(ifelse(as.character(test$fuelType) == "benzin", as.character(test$fuelType), ifelse(as.character(test$fuelType) == "diesel", as.character(test$fuelType), "other")))
test = data.frame(test, fuelType_group)

test2 = test[which(test$price >= 100),]
test2 = data.frame(test2[,c(5,7,8,9,10,12,16,21,22,23)],log(test2$price), log(test2$powerPS))
colnames(test2)[11] = "logPrice"
colnames(test2)[12] = "logPowerPS"

test2 = test2[which(test2$powerPS < 1000),]
test2 = test2[which(test2$yearOfRegistration <= 2018), -1] #remove price (keeping logprice)
test2 = test2[,-2] # remove yearOfRegistration

# Create data frame for final test dataset
test2.df <- data.frame(test2)

# Delete zeroes from powerPS
test2.df <- test2.df[test2.df$powerPS != 0,]

# Delete powerPS variable
test2.df <- test2.df[,-3]

# Combine "andere" and "other"
test2.df$model_groups[which(test2.df$model_groups == "andere")] <- "other" #combine variables since same meaning

#data cleaning complete
test_final <- test2.df

Model Building & Analysis

In order to explore a variety of models for this dataset, we built models with linear regression, ridge regression, Lasso, GAMs, and random forests. The exploratory analysis indicated that the transformed data might fit a linear regression well, so we started there.

Linear Model

# Fit linear model
train.lm <- lm(logPrice~., data=train_final)
summary(train.lm) # R^2 = .69
## 
## Call:
## lm(formula = logPrice ~ ., data = train_final)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -2.8563 -0.4012  0.0500  0.4438  3.3824 
## 
## Coefficients:
##                             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                4.440e+00  4.027e-01  11.027  < 2e-16 ***
## vehicleTypebus             6.068e-01  2.041e-01   2.973 0.002989 ** 
## vehicleTypecabrio          7.696e-01  2.027e-01   3.797 0.000152 ***
## vehicleTypecoupe           7.101e-01  2.054e-01   3.458 0.000558 ***
## vehicleTypekleinwagen      3.969e-01  1.980e-01   2.005 0.045124 *  
## vehicleTypekombi           2.852e-01  1.975e-01   1.444 0.148892    
## vehicleTypelimousine       3.847e-01  1.949e-01   1.974 0.048549 *  
## vehicleTypesuv             6.455e-01  2.093e-01   3.084 0.002078 ** 
## vehicleTypeunknown         3.069e-01  2.056e-01   1.493 0.135612    
## gearboxmanual             -1.374e-01  4.994e-02  -2.752 0.005993 ** 
## gearboxunknown            -1.993e-01  1.204e-01  -1.656 0.097926 .  
## kilometer                 -1.227e-05  4.509e-07 -27.224  < 2e-16 ***
## notRepairedDamageunknown  -4.322e-01  5.304e-02  -8.150 7.06e-16 ***
## notRepairedDamageyes      -9.038e-01  6.270e-02 -14.415  < 2e-16 ***
## brand_groupsbmw           -8.506e-02  1.565e-01  -0.543 0.586899    
## brand_groupsfiat          -1.712e-01  1.598e-01  -1.072 0.283960    
## brand_groupsford          -6.102e-01  1.359e-01  -4.488 7.67e-06 ***
## brand_groupsmercedes_benz -1.144e-02  1.295e-01  -0.088 0.929616    
## brand_groupsopel          -3.214e-01  1.378e-01  -2.332 0.019817 *  
## brand_groupsother         -1.704e-01  1.098e-01  -1.552 0.120861    
## brand_groupspeugeot       -2.409e-01  1.632e-01  -1.476 0.140142    
## brand_groupsrenault       -6.125e-01  1.443e-01  -4.244 2.31e-05 ***
## brand_groupssmart          5.161e-01  3.288e-01   1.569 0.116731    
## brand_groupsvolkswagen     7.386e-02  1.307e-01   0.565 0.571981    
## model_groups3_reihe       -4.815e-01  2.690e-01  -1.790 0.073578 .  
## model_groups3er           -4.183e-01  1.575e-01  -2.656 0.007987 ** 
## model_groups5er           -4.780e-01  1.809e-01  -2.643 0.008291 ** 
## model_groupsa_klasse      -5.886e-01  2.499e-01  -2.356 0.018598 *  
## model_groupsa3            -2.723e-01  2.430e-01  -1.121 0.262580    
## model_groupsa4            -5.241e-01  2.323e-01  -2.257 0.024166 *  
## model_groupsa6            -5.454e-01  2.446e-01  -2.230 0.025905 *  
## model_groupsastra         -2.637e-01  2.314e-01  -1.140 0.254467    
## model_groupsc_klasse      -6.488e-01  2.319e-01  -2.798 0.005206 ** 
## model_groupscorsa         -2.157e-01  2.334e-01  -0.924 0.355532    
## model_groupse_klasse      -5.666e-01  2.446e-01  -2.316 0.020666 *  
## model_groupsfiesta        -2.228e-01  2.588e-01  -0.861 0.389440    
## model_groupsfocus          1.086e-02  2.491e-01   0.044 0.965219    
## model_groupsfortwo        -8.471e-01  3.794e-01  -2.233 0.025710 *  
## model_groupsgolf          -5.378e-01  2.110e-01  -2.549 0.010905 *  
## model_groupsother         -3.709e-01  1.882e-01  -1.970 0.048952 *  
## model_groupspassat        -7.680e-01  2.269e-01  -3.384 0.000731 ***
## model_groupspolo          -3.090e-01  2.215e-01  -1.395 0.163265    
## model_groupstouran        -2.886e-01  2.544e-01  -1.134 0.256815    
## model_groupstransporter   -1.144e-01  2.436e-01  -0.470 0.638727    
## model_groupstwingo         1.791e-01  2.695e-01   0.665 0.506432    
## model_groupsunknown       -4.576e-01  1.987e-01  -2.303 0.021420 *  
## model_groupsvectra        -8.620e-01  2.512e-01  -3.431 0.000615 ***
## fuelType_groupdiesel       5.374e-01  4.273e-02  12.577  < 2e-16 ***
## fuelType_groupother        5.294e-02  6.881e-02   0.769 0.441745    
## logPowerPS                 1.138e+00  5.575e-02  20.413  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.6854 on 1671 degrees of freedom
## Multiple R-squared:  0.684,  Adjusted R-squared:  0.6747 
## F-statistic:  73.8 on 49 and 1671 DF,  p-value: < 2.2e-16
# ANOVA
train.anova <- anova(train.lm)
train.anova
## Analysis of Variance Table
## 
## Response: logPrice
##                     Df Sum Sq Mean Sq   F value    Pr(>F)    
## vehicleType          8 413.55   51.69  110.0421 < 2.2e-16 ***
## gearbox              2 154.80   77.40  164.7620 < 2.2e-16 ***
## kilometer            1 495.12  495.12 1053.9680 < 2.2e-16 ***
## notRepairedDamage    2 183.61   91.81  195.4278 < 2.2e-16 ***
## brand_groups        10 131.56   13.16   28.0042 < 2.2e-16 ***
## model_groups        23  36.17    1.57    3.3473 1.534e-07 ***
## fuelType_group       2  88.26   44.13   93.9397 < 2.2e-16 ***
## logPowerPS           1 195.75  195.75  416.7022 < 2.2e-16 ***
## Residuals         1671 784.98    0.47                        
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
# AIC and BIC
AIC(train.lm) # 3635
## [1] 3635.005
BIC(train.lm) # 3913
## [1] 3912.989
# MSE
pd <- predict(train.lm, test2.df)
test.mse <- mean((test_final$logPrice - pd)^2)
test.mse # MSE = 0.66
## [1] 0.6629623

Residual Plots for Linear Model:

# Residual plots
resf = rstandard(train.lm)
plot(resf, main="Plot of Residuals")

plot(predict(train.lm), resf, xlab="Predicted Values from Linear Model", main="Predicted vs. Residuals")

plot(train_final$kilometer, resf, xlab="Kilometer", main="Kilometer vs. Residuals")

plot(train_final$logPowerPS, resf, xlab="Log-PowerPS", main="Log-PowerPS vs. Residuals")

# Residual boxplots for categorical variables
par(mfrow=c(2,2))
with(train_final,{
  plot(resf~gearbox)
  plot(resf~vehicleType)
  plot(resf~brand_groups)
  plot(resf~model_groups)
})

# qq plot:
par(mfrow=c(1,1))
qqPlot(resf)

Let {\(Y_i : i = 1,...,n\)} denote a set of independent RVs denoting the number of automobiles in our sample. We assume \(Y_i\) ~ \(N(\mu_i,\sigma_i^2)\). Our model for \(Y_i\), the price of a certain vehicle \(x_i\), is

\(Y_i=\beta_0+\beta_1X_i+\beta_2X_i+\beta_3X_i+\beta_4X_i+\beta_5X_i+\beta_6X_i+\beta_7X_i+\epsilon_i\)

The R-squared coefficient is 69%, which could be indicating some level of overfitting. After running cross validation against the testing dataset, the model provides an MSE value of 0.66, which is good, but could be improved. Also, the AIC score is 3635, and the BIC score is 3913. These scores will be useful to us once we compare the linear model to additional models.

The residuals seem to be symmetrically distributed for the linear model, tending to cluster more towards the middle of the plot. However, the residual plots indicate a few outliers in the data, which could impact the predictive performance of our model. For example, the model seems to be overestimating the kilometer data at very low values of kilometers. Some of the outliers seem to be coming from data entries that may be incorrectly entered, since these entries generally have zeroes or improbably low estimates entered in all of the continous variables. We are not certain that these points are incorrect, and removing these entries could cause an increase in the overall variance of the model. Therefore, we will decide to keep the entries for now.

The residuals seem to show a decent degree of departure at the tails, but randomly distributed between the more centralized values. Therefore, the qqPlot proves to challenge the normality assumption about the residuals to some degree. However, this could be an issue with the data, since there may be a high degree of noise in these regions where the residuals depart from the linear line. We will continue to improve our linear model by including some degree of smoothing, and investigate additional models to improve the accuracy of our predictions.

Ridge Regression

#get model matrices set y's
x.train = model.matrix(logPrice~.,train_final)[,-1] 
y.train = train_final$logPrice
x.test = model.matrix(logPrice~.,test_final)[,-1] 
y.test = test_final$logPrice

#observe MSE for many values of lambda
#set seed to replicate 
set.seed(3) 
lambdas=seq(1e-3,300,length=100)
MSE=rep(0,100)
for(i in 1:100)
{
fit.ridge=glmnet(x.train,y.train,alpha=0,lambda=lambdas[i])
pred.ridge=predict(fit.ridge,newx=x.test)
MSE[i]=mean((pred.ridge-y.test)^2)
}
plot(lambdas,MSE,xlab=expression(lambda),ylab="Test set MSE")

#cross validation
ridge.cv = cv.glmnet(x.train,y.train,alpha=0)
plot(ridge.cv) #plot on log-lambda scale

plot(ridge.cv$lambda,ridge.cv$cvm,xlim=c(0,300)) #plot on lambda scale

ridge.lambda.cv = ridge.cv$lambda.min # the minimizing lambda

#fit model 
fit.ridge = glmnet(x.train,y.train,alpha=0,lambda=ridge.lambda.cv)
pred.ridge = predict(fit.ridge,newx=x.test)

After running an initial regression model to some success we decided to apply some of our new knowledge by trying Ridge and Lasso Regression. First the data needed to be converted in to model matrices as required by the glmnet package. To begin our Ridge analysis, we observe the MSEs of 100 different lambdas. Notice that the MSE is smaller for the smaller lambda. To get the optimal lambda we run a cross validation. The package outputs a plot on the log lambda scale, but manually we can see the mean cross-validated error follows a similar trend to the MSEs we plotted.

Lasso

#observe MSE for many values of lambda
#set seed to replicate 
set.seed(3) 
lambdas=seq(1e-3,10,length=100)
MSE=rep(0,100)
for(i in 1:100)
{
fit.lasso=glmnet(x.train,y.train,alpha=1,lambda=lambdas[i])
pred.lasso=predict(fit.lasso,newx=x.test)
MSE[i]=mean((pred.lasso-y.test)^2)
}
plot(lambdas,MSE,xlab=expression(lambda),ylab="Test set MSE")

#cross validation
lasso.cv = cv.glmnet(x.train,y.train,alpha=1)
plot(lasso.cv) #plot on log-lambda scale

plot(lasso.cv$lambda,lasso.cv$cvm,xlim=c(0,.7)) #plot on lambda scale 

lasso.lambda.cv = lasso.cv$lambda.min # the minimizing lambda

#fit model 
fit.lasso = glmnet(x.train,y.train,alpha=1,lambda=lasso.lambda.cv)
pred.lasso = predict(fit.lasso,newx=x.test)

We start our Lasso regression by exploring many MSEs, similar to with the Ridge. Notice how the MSE flattens out almost immediately; we can expect a small optimal lambda. Next, we again run a cross validation before attaining our final model.

Compare Ridge and Lasso

#penalties
sqrt(sum(coef(fit.ridge)[-1,1]^2))
## [1] 2.140395
sum(abs(coef(fit.lasso)[-1,1]))
## [1] 7.951441
#cross validated optimal lambdas
ridge.lambda.cv
## [1] 0.07542384
lasso.lambda.cv
## [1] 0.009517409
#MSE's
mean((y.test-pred.ridge)^2) 
## [1] 0.6617404
mean((y.test-pred.lasso)^2) 
## [1] 0.667504
#Compute R^2 from true and predicted values
rsquare <- function(true, predicted) {
  sse <- sum((predicted - true)^2)
  sst <- sum((true - mean(true))^2)
  rsq <- 1 - sse / sst
  
  return (rsq)
}
rsquare(y.test, pred.ridge)
## [1] 0.5649007
rsquare(y.test, pred.lasso)
## [1] 0.5611111
#summary of coefficients
coef(fit.ridge)
## 51 x 1 sparse Matrix of class "dgCMatrix"
##                                      s0
## (Intercept)                4.911977e+00
## vehicleTypebus             2.063818e-01
## vehicleTypecabrio          3.925963e-01
## vehicleTypecoupe           3.378565e-01
## vehicleTypekleinwagen     -2.893936e-02
## vehicleTypekombi          -9.898315e-02
## vehicleTypelimousine       1.013587e-03
## vehicleTypesuv             2.659037e-01
## vehicleTypeunknown        -9.924979e-02
## gearboxmanual             -1.704349e-01
## gearboxunknown            -2.210354e-01
## kilometer                 -1.173835e-05
## notRepairedDamageunknown  -4.210230e-01
## notRepairedDamageyes      -8.701989e-01
## brand_groupsbmw            1.186835e-01
## brand_groupsfiat          -1.143002e-01
## brand_groupsford          -4.897555e-01
## brand_groupsmercedes_benz  4.669998e-02
## brand_groupsopel          -2.269095e-01
## brand_groupsother         -9.241137e-02
## brand_groupspeugeot       -1.618832e-01
## brand_groupsrenault       -5.183964e-01
## brand_groupssmart          2.888474e-01
## brand_groupsvolkswagen     9.065801e-02
## model_groups3_reihe       -1.452779e-01
## model_groups3er           -1.864435e-01
## model_groups5er           -2.104694e-01
## model_groupsa_klasse      -2.307402e-01
## model_groupsa3             1.569084e-01
## model_groupsa4            -8.284133e-02
## model_groupsa6            -7.873802e-02
## model_groupsandere         .           
## model_groupsastra          3.449666e-02
## model_groupsc_klasse      -2.557823e-01
## model_groupscorsa          6.848212e-02
## model_groupse_klasse      -1.651786e-01
## model_groupsfiesta         4.378621e-02
## model_groupsfocus          2.853915e-01
## model_groupsfortwo        -2.460294e-01
## model_groupsgolf          -1.328169e-01
## model_groupsother         -1.823970e-02
## model_groupspassat        -3.306616e-01
## model_groupspolo           4.791922e-02
## model_groupstouran         1.441807e-01
## model_groupstransporter    2.937679e-01
## model_groupstwingo         4.339889e-01
## model_groupsunknown       -1.304828e-01
## model_groupsvectra        -5.225440e-01
## fuelType_groupdiesel       4.992211e-01
## fuelType_groupother        3.271697e-02
## logPowerPS                 1.025680e+00
coef(fit.lasso)
## 51 x 1 sparse Matrix of class "dgCMatrix"
##                                      s0
## (Intercept)                4.360680e+00
## vehicleTypebus             1.809538e-01
## vehicleTypecabrio          3.631925e-01
## vehicleTypecoupe           2.967151e-01
## vehicleTypekleinwagen      .           
## vehicleTypekombi          -1.155057e-01
## vehicleTypelimousine       .           
## vehicleTypesuv             2.237381e-01
## vehicleTypeunknown        -5.801819e-02
## gearboxmanual             -1.215653e-01
## gearboxunknown            -1.358751e-01
## kilometer                 -1.228283e-05
## notRepairedDamageunknown  -4.147671e-01
## notRepairedDamageyes      -8.858663e-01
## brand_groupsbmw            6.532282e-03
## brand_groupsfiat           .           
## brand_groupsford          -3.829004e-01
## brand_groupsmercedes_benz  .           
## brand_groupsopel          -1.079494e-01
## brand_groupsother         -3.248221e-02
## brand_groupspeugeot       -7.186311e-02
## brand_groupsrenault       -3.992228e-01
## brand_groupssmart          1.637430e-01
## brand_groupsvolkswagen     1.006120e-01
## model_groups3_reihe       -5.040341e-02
## model_groups3er            .           
## model_groups5er           -4.573616e-03
## model_groupsa_klasse      -3.688640e-02
## model_groupsa3             1.409613e-01
## model_groupsa4             .           
## model_groupsa6             .           
## model_groupsandere         .           
## model_groupsastra          .           
## model_groupsc_klasse      -9.669535e-02
## model_groupscorsa          .           
## model_groupse_klasse       .           
## model_groupsfiesta         .           
## model_groupsfocus          1.977004e-01
## model_groupsfortwo         .           
## model_groupsgolf          -5.066263e-02
## model_groupsother          .           
## model_groupspassat        -2.390040e-01
## model_groupspolo           9.323364e-02
## model_groupstouran         1.235251e-01
## model_groupstransporter    3.233691e-01
## model_groupstwingo         3.475467e-01
## model_groupsunknown       -3.536588e-02
## model_groupsvectra        -5.025989e-01
## fuelType_groupdiesel       5.145002e-01
## fuelType_groupother        .           
## logPowerPS                 1.132900e+00

Next we will compare the results of the Ridge and Lasso. The “size” of our Ridge and Lasso regression coefficients given our optimal lambdas are as 2.14 and 7.95 respectively. The cross validated lambdas are quite small, .08 for the Ridge and only .01 for the Lasso. This means that the models introduce a small penalty overall. The MSE’s of the models are nearly identical at .662 and .668 as are the R-squared values of .565 and .561. With such small difference in MSE and R-squared we would prefer the Lasso model which reduces the dimensionality. As seen in the coefficient outputs, several of the vehicle types, brands, models, and “other” fuel group are reduced to zero by the Lasso.

Generalized Additive Model

# Find k for continous
train.gam <- gam(logPrice~s(kilometer, k =-1, bs = "cs"), data=train_final)
summary(train.gam)
## 
## Family: gaussian 
## Link function: identity 
## 
## Formula:
## logPrice ~ s(kilometer, k = -1, bs = "cs")
## 
## Parametric coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  8.13608    0.02533   321.2   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Approximate significance of smooth terms:
##                edf Ref.df     F p-value    
## s(kilometer) 4.736      9 59.03  <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## R-sq.(adj) =  0.235   Deviance explained = 23.8%
## GCV = 1.1078  Scale est. = 1.1041    n = 1721
#gam.check(train.gam) # k = 9

# Find k for continous
train.gam <- gam(logPrice~s(logPowerPS, k =-1, bs = "cs"), data=train_final)
summary(train.gam)
## 
## Family: gaussian 
## Link function: identity 
## 
## Formula:
## logPrice ~ s(logPowerPS, k = -1, bs = "cs")
## 
## Parametric coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  8.13608    0.02354   345.6   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Approximate significance of smooth terms:
##                 edf Ref.df    F p-value    
## s(logPowerPS) 5.803      9 98.6  <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## R-sq.(adj) =  0.339   Deviance explained = 34.2%
## GCV = 0.95762  Scale est. = 0.95383   n = 1721
#gam.check(train.gam) # k = 9

# Calculate Gams
train.gam <- gam(logPrice~s(kilometer,k=9)+s(logPowerPS,k=9)+vehicleType+gearbox+notRepairedDamage+brand_groups+model_groups+fuelType_group, data=train_final)
summary(train.gam) # R^2 = 69%
## 
## Family: gaussian 
## Link function: identity 
## 
## Formula:
## logPrice ~ s(kilometer, k = 9) + s(logPowerPS, k = 9) + vehicleType + 
##     gearbox + notRepairedDamage + brand_groups + model_groups + 
##     fuelType_group
## 
## Parametric coefficients:
##                            Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                8.298756   0.279621  29.679  < 2e-16 ***
## vehicleTypebus             0.667964   0.198774   3.360 0.000796 ***
## vehicleTypecabrio          0.750541   0.197554   3.799 0.000150 ***
## vehicleTypecoupe           0.634858   0.200965   3.159 0.001611 ** 
## vehicleTypekleinwagen      0.402587   0.193844   2.077 0.037968 *  
## vehicleTypekombi           0.321322   0.192348   1.671 0.095004 .  
## vehicleTypelimousine       0.400917   0.189819   2.112 0.034827 *  
## vehicleTypesuv             0.675964   0.204175   3.311 0.000951 ***
## vehicleTypeunknown         0.323178   0.200116   1.615 0.106511    
## gearboxmanual             -0.069714   0.050371  -1.384 0.166541    
## gearboxunknown            -0.192274   0.117562  -1.636 0.102132    
## notRepairedDamageunknown  -0.395247   0.052041  -7.595 5.11e-14 ***
## notRepairedDamageyes      -0.853692   0.061315 -13.923  < 2e-16 ***
## brand_groupsbmw           -0.121274   0.153022  -0.793 0.428165    
## brand_groupsfiat          -0.168576   0.156392  -1.078 0.281234    
## brand_groupsford          -0.589671   0.132994  -4.434 9.87e-06 ***
## brand_groupsmercedes_benz  0.005041   0.126681   0.040 0.968261    
## brand_groupsopel          -0.337916   0.134905  -2.505 0.012346 *  
## brand_groupsother         -0.168685   0.107907  -1.563 0.118186    
## brand_groupspeugeot       -0.212632   0.160267  -1.327 0.184780    
## brand_groupsrenault       -0.582105   0.140966  -4.129 3.82e-05 ***
## brand_groupssmart          0.560272   0.321027   1.745 0.081127 .  
## brand_groupsvolkswagen     0.133785   0.128106   1.044 0.296487    
## model_groups3_reihe       -0.534755   0.263336  -2.031 0.042445 *  
## model_groups3er           -0.435456   0.154022  -2.827 0.004752 ** 
## model_groups5er           -0.535102   0.177653  -3.012 0.002634 ** 
## model_groupsa_klasse      -0.679297   0.245246  -2.770 0.005671 ** 
## model_groupsa3            -0.322838   0.236971  -1.362 0.173270    
## model_groupsa4            -0.567370   0.226181  -2.508 0.012220 *  
## model_groupsa6            -0.674272   0.238963  -2.822 0.004834 ** 
## model_groupsastra         -0.302418   0.227376  -1.330 0.183690    
## model_groupsc_klasse      -0.682026   0.225748  -3.021 0.002556 ** 
## model_groupscorsa         -0.281141   0.230210  -1.221 0.222170    
## model_groupse_klasse      -0.654285   0.238555  -2.743 0.006159 ** 
## model_groupsfiesta        -0.339283   0.255030  -1.330 0.183582    
## model_groupsfocus         -0.073343   0.243749  -0.301 0.763533    
## model_groupsfortwo        -0.924688   0.370651  -2.495 0.012701 *  
## model_groupsgolf          -0.651654   0.207917  -3.134 0.001753 ** 
## model_groupsother         -0.461698   0.185121  -2.494 0.012727 *  
## model_groupspassat        -0.849059   0.221980  -3.825 0.000136 ***
## model_groupspolo          -0.427292   0.218296  -1.957 0.050468 .  
## model_groupstouran        -0.384501   0.248392  -1.548 0.121822    
## model_groupstransporter   -0.268052   0.239647  -1.119 0.263503    
## model_groupstwingo         0.033515   0.264714   0.127 0.899266    
## model_groupsunknown       -0.626922   0.195369  -3.209 0.001358 ** 
## model_groupsvectra        -0.861321   0.245464  -3.509 0.000462 ***
## fuelType_groupdiesel       0.551241   0.041834  13.177  < 2e-16 ***
## fuelType_groupother        0.078709   0.067395   1.168 0.243028    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Approximate significance of smooth terms:
##                 edf Ref.df      F p-value    
## s(kilometer)  4.956  5.835 136.47  <2e-16 ***
## s(logPowerPS) 7.532  7.930  61.57  <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## R-sq.(adj) =  0.693   Deviance explained = 70.4%
## GCV = 0.45962  Scale est. = 0.44347   n = 1721
gam.check(train.gam)

## 
## Method: GCV   Optimizer: magic
## Smoothing parameter selection converged after 6 iterations.
## The RMS GCV score gradient at convergence was 2.006338e-06 .
## The Hessian was positive definite.
## Model rank =  64 / 64 
## 
## Basis dimension (k) checking results. Low p-value (k-index<1) may
## indicate that k is too low, especially if edf is close to k'.
## 
##                  k'   edf k-index p-value
## s(kilometer)  8.000 4.956   1.005    0.55
## s(logPowerPS) 8.000 7.532   0.942    0.00
# Check AIC and BIC
AIC(train.gam) # 3545
## [1] 3545.998
BIC(train.gam) # 3881.148
## [1] 3881.148
# Calculate MSE
pd <- predict(train.gam, test2.df)
test.mse <- mean((test_final$logPrice - pd)^2)
test.mse # MSE = 0.59
## [1] 0.5963492

Let {\(Y_i : i = 1,...,n\)} denote a set of independent RVs denoting the number of automobiles in our sample. We assume \(Y_i\) ~ \(N(\mu_i,\sigma_i^2)\). Our model for \(Y_i\), the price of a certain vehicle \(x_i\), is

\(Y_i=\sum_{j=1}^2f_jX_i+\sum_{k=1}^6\beta_kX_i\)

where each \(f_j\) is a function involving smoothing splines given the appropriate degrees of freedom, k = 9, for each continous predictor, and \(\beta_k\) is a coefficient for each categorical predictor.

The R-squared coefficient is 69%, which could be indicating some level of overfitting. After running cross validation against the testing dataset, the model provides an MSE value of 0.59, which is good, but could be improved. Also, the AIC score is 3545, and the BIC score is 3881.148. These scores are generally worse than previous models, but not too different.

The residuals seem to be symmetrically distributed for the GAMs model, tending to cluster more towards the middle of the plot. Similar to the original lienar model, the residual plots indicate a few outliers in the data, which could similarly impact the predictive performance of our model. For example, the model still seems to be overestimating some of the data at very low values for certain variables. Again, some of the outliers seem to be coming from data entries that may be incorrectly entered, since these entries generally have zeroes or improbably low estimates entered in all of the continous variables. We remain uncertain that these points are incorrect, and removing these entries could cause an increase in the overall variance of the model, as well. Therefore, we will still continue to keep the entries, especially since some of the outliers with higher values are emphasized more in this model’s residual plots.

The residuals seem to show a high degree of departure at the tails, but are randomly distributed between the more centralized values, but not much. Therefore, the qqPlot proves to challenge the normality assumption about the residuals to a decent degree, especially since the GAMs model has the best accuracy measurements. Since more non-linear models have better MSE’s in comparison to linear models, the response seems to have a non-linear relationship with the predictors in our data. After the comparison of models and analysis of the data, it seems that the data possesses a high degree of noise in these regions where the residuals depart from the linear line.

Random Forest

#attempt bagging 
bag.fit = randomForest(logPrice ~ ., data = train_final, mtry = ncol(train_final)-1, ntree = 500, importance = T)
bag.pred = predict(bag.fit, test_final)
#m-try
ncol(train_final)-1
## [1] 8
#mse
mean((test_final$logPrice - bag.pred)^2)
## [1] 0.633451
#variable importance
importance(bag.fit) 
##                     %IncMSE IncNodePurity
## vehicleType        44.21021     169.60382
## gearbox            15.16705      24.34359
## kilometer         138.69356     550.15488
## notRepairedDamage  55.66888     175.14594
## brand_groups       39.45417     152.07608
## model_groups       39.33515     186.52200
## fuelType_group     52.66028     114.45070
## logPowerPS        148.45314    1013.53883
#fit a random forest 
rf.fit = randomForest(logPrice ~ ., data = train_final, mtry = (ncol(train_final)-1)/2, ntree = 500, importance = T)
rf.pred = predict(rf.fit, test_final)
#mtry
(ncol(train_final)-1)/2
## [1] 4
#mse 
mean((test_final$logPrice - rf.pred)^2)
## [1] 0.6175276
#variable importance
importance(rf.fit)
##                     %IncMSE IncNodePurity
## vehicleType        46.51450     215.04549
## gearbox            14.71791      46.22288
## kilometer         133.99571     523.07832
## notRepairedDamage  54.40628     198.04912
## brand_groups       35.97714     172.25801
## model_groups       35.13308     179.14222
## fuelType_group     53.74471     122.31739
## logPowerPS        124.39387     872.03848
plot(rf.fit, main="Random Forest Error by Number of Trees")

plot(rf.pred, test_final$logPrice, xlab="Predicted", ylab="Actual", main="Random Forest Actual LogPrice vs. Predicted LogPrice")

To be even more thorough in our analysis we decided to test out trees and random forests on our data. Due to several of the variables being categorical, there was a chance these types of models could capture some of the interactions between variables that our other methods missed. The tree models yielded poor results and also didn’t respond well to pruning so we have left them out of the final report. On the otherhand bagging and random forest did produce some interesting results. First we ran a random forest with mtry = # of predictors (aka bagging). Bagging produced a good MSE of .63. We are also able to view the variable importance and see that the kilometers and log of horsepower were the most important variables in the model. Next we output a random forest with mtry = 4 or half of the predictors. This produced a slightly smaller MSE of .61 with similar variable importance.

Preferred Model

After comparing all of our various models by predictive accuracy (MSE) and complexity, we decided that the GAMs model is our preferred model for predicting the log-price of cars. Our initial exploratory analysis indicated that a linear model may fit best, and at first, the linear regression model performed the best compared to only ridge and lasso. However, after we explored the GAMs model, we saw a reduction in mean squared error of almost 0.07, an increase in adjusted \(R^2\) of 0.02, and lower AIC and BIC values over the linear model. The table below shows all of our comparison metrics side by side for the values we were able to calculate.

Model MSE AIC/BIC \(R^2\)
Linear 0.663 3635/3912 0.67
Ridge 0.661 0.56
Lasso 0.667 0.57
GAMs 0.596 3546/3881 0.69
Bagging 0.633
RF 0.618

After testing the data across various models, the level of noise that is observed by our accuracy measurements could potentially be a result of the true model’s variance, rather than being a flaw in the model selection process. With that being said, GAMs appears to be the clear winner among our model options and could indicate the existence of potentially better models. It has a decent predictive power, explains the most amount of variance in the data, and is less complex than other models. The residual plots show that it might be missing an important element of the data toward the lower tail especially because we seem to be overestimating the lower prices. Maybe a predictor on overall vehicle quality would capture part of the human decision to set the price online as the sellers did. We might also consider including the yearOfRegistration variable in some way, either in a linear form or as a categorical “old”/“new”/“middle-age” format. Our models assumed part of the car’s age would be reflected in the kilometers driven, but it might not be enough. We could make these changes to our models in the future, but we can’t account for the large number of irregularities in the data from sellers who either don’t list the actual price of the car (“low-ball” prices) or aren’t actually selling cars. On websites like Ebay or Craigslist, it’s hard to separate the serious listings from the advertising gimmicks.